Load-balancing algorithms for data center networks

ABSTRACT

Multipath load-balancing algorithms, which can be used for data center networks (DCNs), are provided. A multipath load-balancing algorithm can be, for example, a distributed multipath load-balancing algorithm or a centralized multipath load-balancing algorithm. Algorithms of the subject invention can be used for, e.g., hierarchical DCNs and/or fat-tree DCNs. Algorithms of the subject invention are effective and scalable and significantly outperform existing solutions.

CROSS REFERENCE TO RELATED APPLICATION

This application is a continuation application of U.S. application Ser.No. 14/297,125, filed Jun. 5, 2014, which claims the benefit of U.S.Provisional Application Ser. No. 61/832,458, filed Jun. 7, 2013, thedisclosures of which are all hereby incorporated by reference in theirentireties, including any figures, tables, and drawings.

BACKGROUND OF INVENTION

Data centers contain large numbers of servers to achieve economies ofscale [19], and the number is increasing exponentially [13]. Forexample, it is estimated that Microsoft's Chicago data center has about300,000 servers [1]. The huge number of servers has created a challengefor the data center network (DCN) to offer proportionally largebandwidth to interconnect the servers [30]. As a result, modern DCNsusually adopt multi-rooted hierarchical topologies, such as the fat tree[2], VL2 [11], DCell [13], and BCube [12], which offer multipathcapability for large bisection bandwidth and increased bandwidth andfault tolerance. For example, FIG. 1 shows a diagram of a hierarchicalfat tree topology. The topology in FIG. 1 has four layers: hosts; edgeswitches; aggregation switches; and core switches, from the bottom totop, and the four core switches act as the multiple roots of thenetwork. As a result, there are two different paths between hosts A andB, as shown in different colors (green and red).

However, traditional link state and distance vector based [16] routingalgorithms (e.g., for the internet) cannot readily utilize the multipathcapability of multi-rooted topologies. Traditional routing algorithmscalculate routes based on only packet destinations, and thus all packetsto the same destination share the same route. Although equal costmultipath (ECMP) [9] supports multipath routing, it performs staticload-splitting based on packet headers without accounting for bandwidth,allows only paths of the same minimum cost, and supports aninsufficiently small number of paths [14]. Further, traditional routingalgorithms usually give preference to the shortest path to reduce thepropagation delay. Due to small geographical distances, DCNs are lessconcerned about the propagation delay, but give priority to bandwidthutilization.

Typical DCNs offer multiple routing paths for increased bandwidth andfault tolerance. Multipath routing can reduce congestion by takingadvantage of the path diversity in DCNs. Typical layer-two forwardinguses a spanning tree, where there is only one path between source anddestination nodes. A recent work provides multipath forwarding bycomputing a set of paths that exploits the redundancy in a givennetwork, and merges these paths into a set of trees, each mapped as aseparate VLAN [19]. At layer three, equal cost multipath (ECMP) [9]provides multipath forwarding by performing static load splitting amongflows. ECMP-enabled switches are configured with several possibleforwarding paths for a given subnet. When a packet arrives at a switchwith multiple candidate paths, the switch forwards it on to the one thatcorresponds to a hash of selected fields of the packet header, thussplitting the load to each subnet across multiple paths. However, ECMPdoes not account for flow bandwidth in making allocation decisions,which may lead to oversubscription even for simple communicationpatterns. Further, current ECMP implementations limit the multiplicityof paths to 8-16, which is fewer than what would be required to deliverhigh bisection bandwidth for larger data centers [14].

There exist multipath solutions for DCNs, including Global First-Fit andSimulated Annealing [4]. The former simply selects among all thepossible paths that can accommodate a flow, but needs to maintain allpaths between a pair of nodes. The latter performs a probabilisticsearch of the optimal path, but converges slowly. The ElasticTree DCNpower manager uses two multipath algorithms, Greedy Bin-Packing andTopology-Aware Heuristic [14]. The former evaluates possible paths andchooses the leftmost one with sufficient capacity. The latter is a fastheuristic based on the topological feature of fat trees, but with theimpractical assumption to split a flow among multiple paths. The MicroTEframework supports multipath routing, coordinated scheduling of traffic,and short term traffic predictability [5].

BRIEF SUMMARY

Embodiments of the subject invention provide multipath load-balancingalgorithms, which can be used for data center networks (DCNs). Inseveral embodiments, a multipath load-balancing algorithm is adistributed multipath load-balancing algorithm. In several embodiments,a multipath load-balancing algorithm is a centralized multipathload-balancing algorithm. Algorithms of the subject invention can beused for, e.g., hierarchical DCNs and/or fat-tree DCNs. Algorithms ofthe subject invention advantageously significantly outperform existingsolutions. In addition, the designs of algorithms of the subjectinvention are effective and scalable.

In an embodiment, a method of load balancing in a network can include:receiving, by a switch, a packet; looking up, by the switch, a packetheader of the packet to check whether the packet belongs to an existingflow; if the packet belongs to an existing flow, forwarding the packetbased on information in a flow table of the switch, and otherwise,creating a new entry in the flow table for the packet and calculatingthe next hop; determining if the next hop is an upstream or downstreamlayer of the network based on a destination IP address; and comparingload values of links to the next layer and selecting a worst-fit link.Such a method can be an example of a distributed multipathload-balancing algorithm.

In another embodiment, a method of load balancing in a network caninclude: checking which layer of the network a packet should go throughbased on locations of a source host of the network and a destinationhost of the network; determining, by a central controller, a bottlenecklink of each potential path corresponding to a different connectinglayer switch; comparing, by the controller, the available bandwidth ofall the potential paths and finding the path with the maximum bandwidth;if the maximum bandwidth is greater than a demand of a flow of thenetwork, then selecting the corresponding path for the flow, andotherwise, determining that no viable path exists for the packet. Such amethod can be an example of a centralized multipath load-balancingalgorithm.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 shows a diagram of a hierarchal fat tree topology.

FIG. 2 shows a diagram of a single pod from a 4-pod fat tree topology.

FIG. 3 shows a plot of average packet delay for different routingalgorithms for non-uniform traffic.

FIG. 4 shows a plot of average packet delay for different routingalgorithms for uniform traffic.

FIG. 5 shows a plot of average network throughput for different routingalgorithms for non-uniform traffic.

FIG. 6 shows a plot of average network throughput for different routingalgorithms for uniform traffic.

DETAILED DISCLOSURE

Embodiments of the subject invention provide multipath load-balancingalgorithms, which can be used for data center networks (DCNs). Inseveral embodiments, a multipath load-balancing algorithm is adistributed multipath load-balancing algorithm. In several embodiments,a multipath load-balancing algorithm is a centralized multipathload-balancing algorithm. Algorithms of the subject invention can beused for, e.g., hierarchical DCNs and/or fat-tree DCNs. Algorithms ofthe subject invention advantageously significantly outperform existingsolutions, such as benchmark algorithms. In addition, the designs ofalgorithms of the subject invention are effective and scalable.

Embodiments of the subject invention provide load-balancing algorithmsthat enable full bandwidth utilization and efficiency packet scheduling.In several embodiments, an algorithm is a centralized multipathload-balancing algorithm. The route can be determined by selecting thetop-layer switch with the minimum bottleneck link load. In anembodiment, the first step is to determine the top-layer layer toconnect the source and destination hosts. Hosts in a fat-tree networktypically have internet protocol (IP) addresses corresponding to theirtopological locations. Determining the top-layer avoids wastingbandwidth of switches at higher layers, which will be available forfuture flows. The second step is then to compare the bottleneck linkload of the candidate paths via different top-layers switches. Thedesign is based on the observation that there is only a single path froma top-layer switch to any host. Therefore, for a specific top-layerswitch, the path from the source to the destination can be determined.By selecting the minimum bottleneck load, the algorithm achieves theload-balancing objective.

In an embodiment, the algorithm can be implemented in a fat-tree networkleveraging flow protocols such as OpenFlow, and supports efficientrouting in commercial data centers. Centralized multipath load-balancingalgorithms of the subject invention help achieve high network throughputand short packet delay.

In several embodiments, an algorithm is a distributed multipathload-balancing algorithm Depth-first search can be used to find asequence of worst-fit links to connect the source and destination of aflow. Because DCN topologies are typically hierarchical, the depth-firstsearch can quickly traverse between the hierarchical layers of switchesto find a path. When there are multiple links to the neighboring layer,the worst-fit criterion always selects the one with the largest amountof remaining bandwidth. By using the max-heap data structure, worst-fitcan make the selection decision with constant time complexity. Further,worst-fit achieves load balancing, and therefore avoids unnecessarybacktracking in the depth-first search and reduces packet queuing delay.The distributed nature of the algorithm can guarantee scalability.

In an embodiment, the algorithm can be conventially implemented inswitches and routers, and supports efficient routing in commercial datacenters. Distributed multipath load-balancing algorithms of the subjectinvention help achieve high network throughput and short packet delay.

Embodiments of the subject invention can be applied for routing inmodern DCNs to improve network throughput and reduce packet latency.Multipath routing is supported by allowing different flows to the samedestination to take different routes, which is not possible withexisting routing algorithms. The multipath capability of modern DCNs canbe fully utilized, while achieving excellent (e.g., perfect) loadbalancing with only local information. Algorithms of the subjectinvention fully utilize the hierarchical characteristic of multi-rooteddata center networks (including fat-tree networks), and are efficientwith low time complexity.

To fully explore bandwidth, DCNs with multi-rooted topologies needpractical and efficient multipath routing algorithms, which shouldsatisfy one or more (ideally all) of the following design objectives.First, the algorithm should maximize bandwidth utilization (i.e.,achieve a high routing success ratio), so that the same network hardwarecan accommodate as much traffic as possible. Second, the algorithmshould achieve load balancing to inhibit the network from generating hotspots, and therefore avoid long queuing delays for the packets. Third,in order for the algorithm to be scalable and handle the large volume oftraffic in DCNs, it must have low time complexity and make fast routingdecisions.

In an embodiment, a depth-first worst-fit search-based algorithm can beused for traditional distributed networks. Depth-first search can beused to find a sequence of worst-fit links to connect the source anddestination of a flow. Because DCN topologies are typicallyhierarchical, the depth-first search can quickly traverse between thehierarchical layers of switches to find a path. When there are multiplelinks to the neighboring layer, the worst-fit criterion selects the onewith the most available bandwidth, and thus balances the traffic loadamong the links. By using the max-heap data structure, worst-fit canmake the selection decision with constant time complexity. Also,worst-fit achieves load balancing, and therefore avoids unnecessarybacktracking in the depth-first search and reduces packet queuing delay.In an embodiment, a centralized algorithm can be used for flow networks,such as

OpenFlow [17], by leveraging the centralized control framework. Thecentral controller can be used to collect information from the entirenetwork and make optimal routing decisions based on such information.The algorithm can first determine all the potential paths for a flow andfind the bottleneck link of each path, which is the link with theminimum available bandwidth. The algorithm can then compare all thebottleneck links and select the path whose bottleneck link has the mostavailable bandwidth. In this way, the algorithm can guarantee that theselected path minimizes the maximum link load of the entire network atthe decision time, and therefore achieves load balancing.

In several embodiments, a multipath routing algorithm uses depth-firstsearch to quickly find a path between hierarchical layers, and usesworst-fit to select links with low time-complexity and avoid creatinghot spots in the network. This is superior to the existing solutionsdiscussed in the Background section.

In an embodiment, a DCN is modeled as a directed graph G =(H ∪S, L), inwhich a node h ∈H is a host, a node s ∈S is a switch, and an edge(n_(i), n_(j)) EL is a link connecting a switch with another switch or ahost. Each edge (n_(i), n_(j)) has a nonnegative capacity c(n_(i),n_(j))≧0 indicating the available bandwidth of the corresponding link.There are n flows F₁, . . . , F_(n) in the DCN. F_(k) is defined as atriple F_(k)=(a_(k), b_(k), d_(k)), where a_(k) ∈H is the source host,b_(k) ∈H is the destination host, and d_(k) is the demanded bandwidth.Use f_(k)(n_(i), n_(j)) to indicate whether flow K_(k) is routed vialink (n_(i), n_(j)).

The load-balancing objective function minimizes the maximum load amongall the links, i.e.

-   -   minimize maxload        subject to the following constraints:

$\begin{matrix}{{\forall{\left( {n_{i},n_{j}} \right) \in L}},{{\sum\limits_{k}\; {{f_{k}\left( {n_{i},n_{j}} \right)}d_{k}}} \leq {{c\left( {n_{i},n_{j}} \right)}{maxload}} \leq {c\left( {n_{i},n_{j}} \right)}}} & (1) \\{{\forall k},{\forall{n_{i} \in {H\bigcup{S\backslash \left\{ {a_{k},b_{k}} \right\}}}}},{{\sum\limits_{n_{j} \in {H\bigcup S}}\; {f_{k}\left( {n_{i},n_{j}} \right)}} = {\sum\limits_{n_{j} \in {H\bigcup S}}\; {f_{k}\left( {n_{j},n_{i}} \right)}}}} & (2) \\{{\forall k},{{\sum\limits_{n_{i} \in {H\bigcup S}}\; {f_{k}\left( {a_{k},n_{i}} \right)}} = {{\sum\limits_{n_{i} \in {H\bigcup S}}\; {f_{k}\left( {n_{i},b_{k}} \right)}} = 1}}} & (3)\end{matrix}$

Equation (1) defines maxload, and states the link capacity constraint(i.e., the total demanded bandwidth on a link not exceeding itsavailable bandwidth). Equation (2) states the flow conservationconstraint (i.e., the amount of any flow not changing at intermediatenodes). Equation (3) states the demand satisfaction constraint (i.e.,for any flow, the outgoing traffic at the source or the incoming trafficat the destination equal to the demand of the flow).

The load-balanced multipath routing problem can be proven to beNP-complete by reduction of the integer partition problem. The followingtheorem shows the NP-hardness of the studied problem.

Theorem 1: The load-balanced multipath routing problem is NP-hard forthe fat tree topology.

Proof: The theorem is proven by reduction from the integer partitionproblem [18]. An integer partition problem decides whether a set ofintegers A={a₁, . . . , a_(n)} can be partitioned into two subsets P andA\P such that the sum of elements in P is equal to that in A\P, i.e.,

$\begin{matrix}{{\exists{P \subseteq A}},{{\sum\limits_{a_{i} \in P}\; a_{i}} = {\sum\limits_{a_{i} \in {A\backslash P}}\; a_{i}}}} & (4)\end{matrix}$

To reduce the load-balanced problem from the above integer partitionproblem, consider an instance of a partition problem with set A, and aninstance of the load-balanced multipath routing problem can beconstructed under the fat-tree topology as follows. First, a 4-pod fattree network can be set up, in which each link has infinite link. Thedetail of one pod is shown in FIG. 2. The infinite link bandwidthsatisfies the first constraint of the load balancing problem. Next, twohosts are considered, of different edge switches but situated in thesame pod, as labeled by 1 and 3 in FIG. 2. The former is the sourcewhile the latter is the destination. This means that the flows can onlychoose from two paths, path ACD and path ABD, as shown in FIG. 2, inorder to reach the destination. There are n flows from the source to thedestination, and their demands are represented by each element in set A,which is the instance of the partition problem.

Assume that a successful partition is present in set A, as shown inEquation 4. This implies that two equal subsets of set A are present andeach subset of flow demands is assigned to one of the paths, and assuresminimization of maximum load among them, reducing the maximum load tohalf. If successful partition is not achieved, unequal subsets wouldresult in one of the paths having under-loaded links while the otherpath will be facing congestion. This also satisfies the remaining twoconstraints of the load balancing problem.

In the other direction, consider that a perfectly load-balanced networkis present in FIG. 2, i.e., the current load in the network is equal tohalf of the maximum load. Also, the flows passing through path ABD havethe same total demand as the flows passing through path ACD.Accordingly, for the integer partition problem, a subset A_(s) can befound, whose elements are corresponding to the flows traversing pathABD. Thus, A_(s) and A\A_(s) have the same sum. Hence the load balancingproblem is NP hard for typical fat tree topology.

Theorem 2: The load-balanced multipath routing problem is NP-hard forthe VL2 topology.

Proof: The proof is similar to the one discussed above. It is due to thefact that in case of a VL2 network, the connections between edge andaggregation switches are in similar fashion as in a 4-pod fat treenetwork, as seen in FIG. 2.

In several embodiments of the subject invention, an algorithm usesdepth-first search. Depth-first search utilizes the hierarchical featureof DCN topologies to quickly find a path connecting the hierarchicallayers. DCNs are typically organized in a hierarchical structure withmultiple layers of switches and one layer of hosts [11]-[13]. Forexample, a fat tree [20] DCN has one layer of hosts, edge switches,aggregation switches, and core switches, respectively, as shown inFIG. 1. Since a path typically has links connecting hosts and switchesat different layers, depth-first search can quickly traverse theselayers. For example, a path connecting two servers of the same edgeswitch in a fat tree will traverse from the host layer to the edgeswitch layer and then back to the host layer, as shown in FIG. 1. If thesearch has exhausted all the links in a layer and cannot proceedfurther, it is necessary to backtrack to the previous layer [8] and trythe next candidate.

In several embodiments of the subject invention, an algorithm usesworst-fit. When there are multiple links to the neighboring layer, theworst-fit criterion can select the one with the most availablebandwidth. On the other hand, the first-fit criterion ([4], [14])selects the first (or leftmost) link with sufficient availablebandwidth, and best-fit selects the link with the least but sufficientavailable bandwidth.

Compared with first-fit or best-fit, worst-fit has the followingadvantages. First, worst-fit has time complexity of 0(1) by trying onlythe link with the largest bandwidth. Since the controller needs tosearch a path on the fly for each flow, constant time complexity helpsaccelerate the routing process. In contrast, first-fit has timecomplexity 0(log N) to select from N candidates, where N grows with theDCN size, using the special winner tree data structure [15]. Similarly,best-fit has time complexity of 0(log N) by conducting binary search ona pre-sorted list. Second, worst-fit achieves load balancing by evenlydistributing traffic among all links, and therefore it needs fewer linkselections on the average to find a path. This characteristic also helpsworst-fit find a path faster than first-fit and best-fit by avoidingexcessive backtracking. As a comparison, first-fit and best-fit tend toconsolidate traffic to certain links and eventually block them. If allthe neighboring links of a switch are blocked, the path searching has tobacktrack to the previous layer, and thus needs more link selectiondecisions. Third, because worst-fit achieves load balancing, it is lesslikely to create hot spots in the network, avoiding long packet queuingdelay. On the other hand, first-fit and best-fit keep increasing theload of a link until it is saturated. In this case, heavily loaded linkssuffer from extra latency, while some other links are still idle.

In several embodiments, an algorithm is a distributed algorithm. Becausethe algorithm runs in a distributed manner, each switch worksindependently. When a switch receives a new packet, the switch firstchecks whether the packet belongs to an existing flow by looking up thepacket header. If yes, there is already an entry in the flow table forthe existing flow, and the switch will forward the packet based on theinformation in the flow table. Otherwise, if the packet is the first oneof a new flow, the switch will create a new entry for the flow in itsflow table, and calculate the next hop. Then, the switch can determinewhether the next hop should be an upstream or downstream layer. Hosts inDCNs typically have IP addresses corresponding to their topologicallocations [6]. Therefore, based on the destination IP address, theswitch can find by which layer the source and destination hosts can beconnected. For example, in a fat tree based DCN, hosts in the same podtypically share the same subnet address [20], and a flow between hostsin different pods has to go through a core switch. Thus, the flow willfirst go upstream until it reaches a core switch, and then heads backdownstream until it arrives at the destination. Next, if there aremultiple links to the next layer, the switch compares the load values ofthe links and selects the worst-fit one. In the case that there is noviable link to the next layer, the switch will send the packet back tothe previous hop for backtracking.

In order to take optimal routing decisions for each flow and to achievedistributed control over the network, each switch can maintain a flowtable. An entry in the flow table maintained at each switch in thenetwork can include the source address, source port number, thedestination address, destination port number, and the outgoing link onwhich the flow is assigned [17]. A separate entry is made for each flowthat arrives at a particular switch. Whenever a new flow arrives at aswitch, the switch can create an entry in the flow table for the newflow. After a link is selected for the flow, it is added to the entry asan outgoing link for that particular flow. The flow table in the switchcan help determine whether the packet should be treated as the packet ofa new flow, an existing flow, or backtracked. If the received packet isfrom a new flow, no entry would be present in the flow table, and thealgorithm will start searching for a link for this flow. If the packetis from an existing flow, the switch will already have an entry and willsend the packet to its outgoing link after looking at the entry in itsflow table. A packet would be treated as a backtracked packet if it isreceived from its outgoing link, i.e., the link on which this packet waspreviously assigned, as the switch will already have an entry for thatpacket. From the discussion in this and the previous paragraph, analgorithm of the subject invention plainly fulfills design objectivesdiscussed above. First, with the help of link availability searching andbacktracking, it can find a path if one exists. Second, with the help oflink load probabilities, it guarantees a load balanced network.

In several embodiments, an algorithm is a centralized load-balancedmultipath routing algorithm. It can be implemented on, e.g., theOpenFlow protocol. A centralized controller can be used to collect loadinformation of all the links in the network, and make a globally optimaldecision. When a new flow comes, the controller can enumerate all thepossible paths, compare the loads of the bottleneck links of them, andselect the one with the minimum load.

To optimize bandwidth utilization in DCNs, it is important to have aglobal view of the available resources and requests in the network [5].A central controller can be utilized for this purpose and cancommunicate with switches in the DCN (e.g., by the OpenFlow protocol)[27]. Each OpenFlow enabled switch has a flow table to control flows,where a flow can be flexibly defined by any combination of the tenpacket headers at arbitrary granularity [17]. The controller can controlthe flows by querying, inserting, or modifying entries in the flowtables. In this way, the central controller can collect bandwidth andflow information of the entire network from the switches, make optimalrouting decisions, and send the results back to the switches to enforcethe planning routing. Multiple choices of OpenFlow devices are alreadyavailable on the market [21], [22], [24]-[26], and OpenFlow has beenadopted in many recent data center designs [4], [5], [14], [20]. Theavailability of OpenFlow switches makes it practical to quicklyexperiment with and deploy algorithms of the subject invention.

In an embodiment of a centralized load-balanced multipath routingalgorithm, in the first step, the algorithm checks which layer the pathneeds to go through based on the locations of the source and destinationhosts. If they are in different pods, then the path needs to go througha core switch. If they are attached to the same edge switch, then theedge switch will connect them. Otherwise, if they are in the same podbut not under the same edge switch, an aggregation switch is necessaryto connect them. In the second step, the central controller determinesthe bottleneck link of each potential path corresponding to a differentconnecting layer switch. Note that in a fat tree network, once theconnecting layer is determined, the path from the source to thedestination is determined as well. The central controller has theinformation of every link in the network, and this can be done quicklyby comparing the loads of all the links on the path. The link with thesmallest available bandwidth is called the bottleneck link of this path.In the third step, the controller compares the available bandwidth ofall the potential paths and finds the one with the maximum. If themaximum bandwidth is greater than the demand of the flow, then thecorresponding path is selected for the flow. The controller can then setup the flow tables of all the switches on the path accordingly.Otherwise, if the maximum bottleneck link bandwidth is less than theflow demand, there does not exist a viable path for this flow.

As discussed above, the centralized algorithm can run in the OpenFlowcontroller that has a global view of the entire network. After thealgorithm successfully finds a path, it will deduct the flow demand fromthe available bandwidth of the links in the path and update the heaps.Fortunately, paths in DCNs typically have a small number of hops, sothat the process can finish quickly. From the discussion in this and theprevious few paragraphs, a centralized algorithm of the subjectinvention plainly fulfills design objectives discussed above. First, itachieves high bandwidth utilization by exhaustive search, and canguarantee to find a path if one exists. Second, excellent (e.g.,perfect) load-balancing can be achieved or even guaranteed by selectingthe path whose bottleneck link has the most available bandwidth. Third,the logarithmic comparison operation can achieve or even guarantee lowtime complexity and fast routing decisions.

DCNs often rely on multipath capability for increased bandwidth andfault tolerance. Embodiments of the subject invention provideload-balanced multipath routing algorithms to fully efficiently utilizeavailable bandwidth in DCNs. The problem can be formulated as a linearprogram and shown that it is NP-complete for typical DCN topologies byreduction from the integer partitioning problem. The NP-completenessproof shows that the problem has no efficient polynomial-time solutions,and therefore only approximation solutions may be possible. In variousembodiments, a distributed algorithm can be used for traditionalnetworks, and a centralized algorithm can be used for flow networks. Adistributed algorithm can use depth-first search to quickly traversebetween the hierarchical layers, and can adopt the worst-fit linkselection criterion to achieve load balancing and low time complexity.The centralized algorithm can rely on a centralized control framework ofthe OpenFlow protocol and can collect complete information of thenetwork and make optimal load balancing decisions.

Large scale simulations of the algorithms of the subject invention,using the NS-3 simulator demonstrate the effectiveness and scalabilityof the designs of the algorithms of the subject invention. This is truefor both distributed and centralized algorithms of the subjectinvention.

In several embodiments, algorithms (both distributed and centralized)advantageously rely on network devices and can be transparent to hosts.Algorithms of the subject invention have at least the followingadvantages. First, data center customers can run their software oncommodity operating systems without compromising security orcompatibility. Second, users can enjoy the benefits of load-balancedmultipath routing even if their operating systems are not open-sourceand consequently not customizable. Third, algorithms of the subjectinvention have lower maintenance costs because operating systems andhosts upgrade more frequently than network devices do.

The methods and processes described herein can be embodied as codeand/or data. The software code and data described herein can be storedon one or more computer readable media, which may include any device ormedium that can store code and/or data for use by a computer system.When a computer system reads and executes the code and/or data stored ona computer-readable medium, the computer system performs the methods andprocesses embodied as data structures and code stored within thecomputer-readable storage medium.

It should be appreciated by those skilled in the art thatcomputer-readable media include removable and non-removablestructures/devices that can be used for storage of information, such ascomputer-readable instructions, data structures, program modules, andother data used by a computing system/environment. A computer-readablemedium includes, but is not limited to, volatile memory such as randomaccess memories (RAM, DRAM, SRAM); and non-volatile memory such as flashmemory, various read-only-memories (ROM, PROM, EPROM, EEPROM), magneticand ferromagnetic/ferroelectric memories (MRAM, FeRAM), and magnetic andoptical storage devices (hard drives, magnetic tape, CDs, DVDs); networkdevices; or other media now known or later developed that is capable ofstoring computer-readable information/data. Computer-readable mediashould not be construed or interpreted to include any propagatingsignals.

EXAMPLES

Following are examples that illustrate procedures for practicing theinvention. These examples should not be construed as limiting.

Example 1

Experimental performance results were obtained from a real testbed,implemented using the Beacon OpenFlow controller, HP ProCurve OpenFlowswitches, VMware vCenter server, and VMware ESXi hypervisor.

A 4-pod and 16-host fat-tree prototype was built to demonstrate theeffectiveness and practicality of the optimization algorithm in realnetworks. Two OpenFlow-enabled 48-port HP ProCurve 6600 switches runningfirmware version K.15.06.5008 were utilized, and 20 virtual switcheswere created. Each virtual switch was assigned with 4 ports, except thefirst core layer switch had 3 extra ports to allow connections formanagement nodes, including VMware vCenter server, Network File System(NFS) server, and DHCP server. All switches were managed by BeaconOpenFlow controller version 1.0.0 with a self-developed Equinoxframework bundle that implemented the optimization algorithm. Each hostwas running VMware ESXi hypervisor version 5.0.0 to host VMs runningoperating system of Ubuntu Server 12.04.1 LTS 64 bit. The hosts and VMswere configured to request IP address upon startup through DHCPprotocol. When the controller detected the DHCP discovery message sentby a host or a VM, it recorded the host's or the VM's MAC address andlocation based on which input port of which ToR switch received themessage. The IP address of the host or VM was updated when thecontroller detected the DHCP offer message. All hosts and VMs wereremotely managed by VMware vCenter server version 5.0.0. Each VM's filesystem is provided by a NFS server implemented on a Linux PC runningUbuntu version 12.04.

Iperf UDP flows were employed to emulate the production traffic in datacenters. The controller assigned initial routing paths to the flows. Theinitial routing paths were calculated by using Shortest Path Routingalgorithm. If multiple routing paths from the source to the destinationexist, the controller selects one of them randomly. For each switch onthe routing path, the controller also calculates each flow's input andoutput ports. The controller installed the flow table entries to all theswitches on the routing path by sending them ofp_flow_mod messages withthe flow's match information and the calculated input and output ports.

Experiments were conducted on six algorithms as follows: CLB(centralized load balancing), DLB (distributed load balancing routing),SLB (static load balancing), NLB (none load balancing with a hashfunction), OSPF (link-state routing), and RIP (distance-vector routing).CLB and DLB are routing algorithms according to embodiments of thesubject invention, and the other four are comparison algorithms. SLBperforms flow routing based on pre-defined decisions, which distributeuniform traffic evenly on every link in the network to achieve loadbalancing. NLB randomly picks one from all available paths between hostsbased on a hash function. The OSPF routing was simulated on OpenFlowcontroller, with the cost of each link asCost=Reference/Bandwidth=1000000 Kbps/1Gbps=0.1. The OpenFlow controllercalculates the shortest distance path for a flow. If there are multipleshortest distance paths, the controller picks one path randomly. The RIProuting was simulated on OpenFlow controller. To solve looping problems,a spanning tree was manually created by shutting down links in the fattree network.

To test each routing algorithm's performance, two types of traffic wereconducted: non-uniform and uniform. The 16 hosts were named from host 1to host 16. Also, the 4 pods were named from pod 1 to pod 4. Each podcontained 4 hosts in ascending order. A flow is a sequence of packetstransmitted from Iperf client to Iperf server. In non-uniform traffic,there are four groups of flows: in group (1) hosts 1, 5, 9 are clientsand all hosts in pod 4 are servers; in group (2) hosts 2, 6, 14 areclients and all hosts in pod 3 are servers; in group (3) hosts 3, 11, 15are clients and all hosts in pod 2 are servers; in group (4) hosts 8,12, 16 are clients and all hosts in pod 1 are servers. During eachgroup, the selection of Iperf clients follows a round-robin fashion. Theselection of Iperf servers is stochastic. Overall, there are 12 hosts asclients and 16 hosts as servers. For each routing algorithm, nine testswere conducted as the number of flows each client generates ranges from1 to 9, mapping from load 0.1 to load 0.9. In uniform traffic, each hosthad the same probability to be selected as Iperf client as well asserver. However, one restriction was that one UDP flow's client andserver cannot be the same host. Load was measured as the total bandwidthof all flows generated by all clients divided by number of clients. Theperformance of each algorithm was tested with load from 0.1 to 0.9.

End-to-end packet delay of different routing algorithms was examined.End-to-end packet delay measures the time from when the packet is sentby the source to the time it received by the destination. Shortend-to-end delay indicates that the network is not congested, and viceversa.

FIG. 3 shows the average end-to-end delay under non-uniform traffic. InFIG. 3, CLB is the blue line (with plus signs), DLB is the green line(with asterisks), SLB is the red line (with X's), NLB is the teal line(with circles), OSPF is the purple line (with diamonds), and RIP is thetan line (with triangles). Referring to FIG. 3, CLB and DLBsignificantly outperform the remaining algorithms, showing that theyachieve better load balance. In particular, CLB has slightly shorterdelay than DLB, which demonstrates the superiority of the centralizedcontrol framework. By contrast, RIP has the longest end-to-end delay,because it does not support multipathing, and thus cannot utilize themulti-rooted fat tree topology. SLB has the second longest delay sinceit has a static load balancing mode and suffers from the non-uniformtraffic pattern. Finally, NLB and OSPF have better performance than RIPand SLB but worse than CLB and DLB. NLB and OSPF have limited supportfor multipath load balancing.

FIG. 4 shows the average end-to-end delay under uniform traffic. In FIG.4, CLB is the blue line (with plus signs), DLB is the green line (withasterisks), SLB is the red line (with X's), NLB is the teal line (withcircles), OSPF is the purple line (with diamonds), and RIP is the tanline (with triangles). Referring to FIG. 4, a similar result is observedto that with the non-uniform traffic. CLB and DLB have the shortestdelay, RIP has the longest delay, and NLB and OSPF are in the middle.However, the performance of SLB is significantly improved, because itsstatic load balancing mode works better under the uniformly distributedtraffic.

Next, the network throughput of different routing algorithms wasexamined. Network throughput is the ratio of the total amount of trafficsent by all hosts, and more server congestion leads to lower throughput.

FIG. 5 shows the average network throughput under the non-uniformtraffic. In FIG. 5, CLB is the blue line (with plus signs), DLB is thegreen line (with asterisks), SLB is the red line (with X's), NLB is theteal line (with circles), OSPF is the purple line (with diamonds), andRIP is the tan line (with triangles). Referring to FIG. 5, CLB and DLBhave the highest network throughput (i.e., the least congestion). Evenif the traffic load is 0.9, the throughput for CLB and DLB can reachabout 85%. Consistently, CLB performs slightly better than DLB,benefiting from the centralized control and more optimized routing. RIPand SLB have the lowest throughput, because the former does not supportmultipathing and the latter has a static load balancing mode notcompatible with the non-uniform traffic pattern. Again, NLB and OSPFhave better performance than RIP and SLB, but worse than CLB and DLB.

FIG. 6 shows the average network throughput under the uniform traffic.In FIG. 6, CLB is the blue line (with plus signs), DLB is the green line(with asterisks), SLB is the red line (with X's), NLB is the teal line(with circles), OSPF is the purple line (with diamonds), and RIP is thetan line (with triangles). Referring to FIG. 6, the results are similarto those under the non-uniform traffic, except that the performance ofSLB improves significantly due to the uniformly distributed trafficpattern. However, CLB and DLB still have the highest network throughput.

Algorithms were implemented in a real testbed and in the NS-3 simulator.Both the experiment and simulation results demonstrated that algorithmsof the subject invention (e.g., CLB and DLB) significantly outperformexisting solutions on network throughput and packet delay.

All patents, patent applications, provisional applications, andpublications referred to or cited herein, including those listed in the“References” section, are incorporated by reference in their entirety,including all figures and tables, to the extent they are notinconsistent with the explicit teachings of this specification.

It should be understood that the examples and embodiments describedherein are for illustrative purposes only and that various modificationsor changes in light thereof will be suggested to persons skilled in theart and are to be included within the spirit and purview of thisapplication.

REFERENCES

-   [1] “Who Has the Most Web Servers?”    http://www.datacenterknowledge.com/archives/2009/10/13/facebook-now-has-30000-servers/.-   [2] M. Al-Fares, A. Loukissas, and A. Vandat, “A scalable, commodity    data center network architecture,” in ACM SIGCOMM, Seattle, Wash.,    August 2008.-   [3] M. Al-Fares and A. V. Alexander L., “A scalable, commodity data    center network architecture,” Dept. Of Computer Science and    Engineering, University of California, San Diego, Tech. Rep.-   [4] M. Al-fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A.    Vandat, “Hedera: dynamic flow scheduling for data center networks,”    in USENIX NSDI, San Josa, Calif., April 2010.-   [5] T. Benson, A. Anand, A. Akella, and M. Zhang, “The case for    fine-grained traffic engineering in data centers,” in USENIX    INM/WREN, San Jose, Calif., April 2010.-   [6] K. Chen, C. Guo, H. Wu, J. Yuan, Z. Feng, Y. Chen, S. Lu, and W.    Wu, “Generic and automatic address configuration for data center    networks,” in ACM SIGCOMM, New Delhi, India, August 2010.-   [7] K. Chen, C. Hu, X. Zhang, K. Zheng, Y. Chen, and A. V.    Vasilakos, “Survey on routing in data centers: Insights and future    directions,” Tech. Rep., July/August 2011.-   [8] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein,    Introduction to Algorithms, 3rd ed. MIT Press, 2009.-   [9] “IP Multicast Load Splitting—Equal Cost Multipath (ECMP) Using    S, G and Next Hop,” http://www.cisco.com/en/US/docs/ios/12 2sr/12    2srb/feature/guide/srbmpath.htmlg.-   [10] C. Fraleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, D. Moll, R.    Rockell, T. Seely, and S. C. Diot, “Packet-level traffic    measurements from the sprint ip backbone,” IEEE Network, vol. 17,    no. 6, pp. 6-16, November 2003.-   [11] A. Greenberg, N. Jain, S. Kandula, C. Kim, P. Lahiri, D.    Maltz, P. Patel, and S. Sengupta, “V12: a scalable and flexible data    center network,” in ACM SIGCOMM, Barcelona, Spain, August 2009.-   [12] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y.    Zhang, and S. Lu, “Bcube: a high performance, server-centric network    architecture for modular data centers,” in ACM SIGCOMM, Barcelona,    Spain, August 2009.-   [13] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu, “Dcell: a    scalable and fault-tolerant network structure for data centers,” in    ACM SIGCOMM, Seattle, Wash., August 2008.-   [14] B. Heller, S. Seetharaman, P. Mahadevan, Y. Yiakoumis, P.    Sharma, S. Banerjee, and N. Mckeown, “Elastictree: saving energy in    data center networks,” in USENIX NSDI, San Josa, Calif., April 2010.-   [15] D. Johnson, “Fast algorithms for bin packing,” Journal of    Computer and System Sciences, vol. 8, no. 3, p. 272314, June 1974.-   [16] J. Kurose and K. Ross, Computer networking: a top-down approach    (4th Edition), 4th ed. Addison Wesley, 2007.-   [17] N. Mckeown, S. Shenker, T. Anderson, L. Peterson, J. Turner, H.    Balakrishnan, and J. Rexford, “Openflow: enabling innovation in    campus networks,” ACM SIGCOMM Computer Communication Review, vol.    38, no. 2, pp. 69-74, April 2006.-   [18] D. S. J. Michael R. Garey, Computers and Intractability, A    guide to the theory of NP Completeness.-   [19] J. Mudigonda and P. Yalagandula, “Spain: Cots data-center    ethernet for multipathing over arbitrary topologies,” in USENIX    NSDI, San Josa, Calif., April 2010.-   [20] R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S.    Radhakrishnan, V. Subramanya, and A. Vandat, “Portland: a scalable    fault-tolerant layer2 data center network fabric,” in ACM SIGCOMM,    Barcelona, Spain, August 2009.-   [21] J. Naous, D. Erickson, A. Covington, G. Appenzeller, and N.    McKeown, “Implementing an openflow switch on the netfpga platform,”    in ACM/IEEE ANCS, San Jose, Calif., November 2008.-   [22] “Cisco OpenFlow Switches,”    http://blogs.cisco.com/tag/openflow/.-   [23] “GENI OpenFlow Backbone Deployment at Internet2,”    http://groups.geni.net/geni/wiki/OFI2.-   [24] “HP OpenFlow Switches,”    http://h30507.www3.hp.com/t5/HP-Networking/Take-control-of-the-network-OpenFlow-a-new-tool-for-building-and/ba-p/92201.-   [25] “NEC OpenFlow Switches,” http://www.necam.com/pflow/.-   [26] “OpenFlow 1.0 Release,”    http://www.openflowswitch.org/wk/index.php/OpenFlow v1.0.-   [27] “The OpenFlow Consortium,” http://www.openflowswitch.org.-   [28] D. Pan and Y. Yang, “Localized independent packet scheduling    for buffered crossbar switches,” IEEE Transactions on Computers,    vol. 58, no. 2, pp. 260-274, February 2009.-   [29] “OpenFlow Slicing,”    http://www.openflowswitch.org/wk/index.php/Slicing.-   [30] G. Wang, D. G. Andersen, M. Kaminsky, M. Kozuch, T. S. E.    Ng, K. Papagiannaki, and M. Ryan, “c-through: part-time optics in    data centers,” in ACM SIGCOMM, New Delhi, India, August 2010.

What is claimed is:
 1. A method of load balancing in a network,comprising: receiving, by a switch, a packet; looking up, by the switch,a packet header of the packet to check whether the packet belongs to anexisting flow; if the packet belongs to an existing flow, forwarding thepacket based on information in a flow table of the switch, andotherwise, creating a new entry in the flow table for the packet andcalculating the next hop; determining if the next hop is an upstream ordownstream layer of the network based on a destination IP address; andcomparing load values of links to the next layer and selecting aworst-fit link.
 2. The method according to claim 1, wherein, if noviable link to the next layer exists, the method further comprisessending, by the switch, the packet back to the previous hop forbacktracking.
 3. The method according to claim 1, wherein each entry inthe flow table of the switch includes at least one piece of informationselected from the group consisting of: source address; source portnumber; destination address; destination port number; and outgoing linkon which the flow is assigned.
 4. The method according to claim 1,wherein each entry in the flow table of the switch includes all of thefollowing pieces of information: source address; source port number;destination address; destination port number; and outgoing link on whichthe flow is assigned.
 5. The method according to claim 1, wherein thenetwork is a fat tree based data center network.
 6. The method accordingto claim 1, wherein the flow table of the switch determines whether thepacket is treated as part of a new flow, treated as part of an existingflow, or backtracked.
 7. The method according to claim 1, wherein thenetwork is represented by a model, such that the network is modeled as adirected graph G=(H ∪S, L), wherein a node h ∈H is a host, wherein anode s ∈S is a switch, and wherein an edge (n_(i), n_(j)) ∈L is a linkconnecting a switch with another switch or a host.
 8. A method of loadbalancing in a network, comprising: checking which layer of the networka packet should go through based on locations of a source host of thenetwork and a destination host of the network; determining, by a centralcontroller, a bottleneck link of each potential path corresponding to adifferent connecting layer switch; comparing, by the central controller,the available bandwidth of all the potential paths and finding the pathwith the maximum bandwidth; and if the maximum bandwidth is greater thana demand of a flow of the network, then selecting the corresponding pathfor the flow, and otherwise, determining that no viable path exists forthe packet.
 9. The method according to claim 8, further comprising,after determining the bottleneck link of each potential pathcorresponding to a different connecting layer switch, determining a pathfrom the source host to the destination host.
 10. The method accordingto claim 8, wherein checking which layer of the network the packetshould go through comprises determining which top-layer layer of thenetwork the packet should go through.
 11. The method according to claim8, wherein the network is a fat tree based data center network.
 12. Asystem for load balancing a network, wherein the system comprises one ofthe following: a) a switch configured to: receiving a packet; look up apacket header of the packet to check whether the packet belongs to anexisting flow; if the packet belongs to an existing flow, forward thepacket based on information in a flow table of the switch, andotherwise, create a new entry in the flow table for the packet andcalculate the next hop; determine if the next hop is an upstream ordownstream layer of the network based on a destination IP address; andcompare load values of links to the next layer and selecting a worst-fitlink; or b) a central controller configured to: check which layer of thenetwork a packet should go through based on locations of a source hostof the network and a destination host of the network; determine abottleneck link of each potential path corresponding to a differentconnecting layer switch; compare the available bandwidth of all thepotential paths and find the path with the maximum bandwidth; and if themaximum bandwidth is greater than a demand of a flow of the network,then select the corresponding path for the flow, and otherwise,determine that no viable path exists for the packet.
 13. The systemaccording to claim 12, wherein the system comprises the switch, andwherein the switch is further configured to send the packet back to theprevious hop for backtracking if no viable link to the next layerexists.
 14. The system according to claim 12, wherein the systemcomprises the switch, and wherein each entry in the flow table of theswitch includes at least one piece of information selected from thegroup consisting of: source address; source port number; destinationaddress; destination port number; and outgoing link on which the flow isassigned.
 15. The system according to claim 12, wherein the systemcomprises the switch, and wherein the flow table of the switchdetermines whether the packet is treated as part of a new flow, treatedas part of an existing flow, or backtracked.
 16. The system according toclaim 12, wherein the system comprises the switch, and wherein thenetwork is represented by a model, such that the network is modeled as adirected graph G=(H ∪S, L), wherein a node h ∈H is a host, wherein anode s ∈S is a switch, and wherein an edge (n_(i), n_(j)) ∈L is a linkconnecting a switch with another switch or a host.
 17. The systemaccording to claim 12, wherein the system comprises the centralcontroller, and wherein the central controller is further configured todetermine a path from the source host to the destination host afterdetermining the bottleneck link of each potential path corresponding toa different connecting layer switch.
 18. The system according to claim12, wherein the system comprises the central controller, and whereinchecking which layer of the network the packet should go throughcomprises determining which top-layer layer of the network the packetshould go through.
 19. The system according to claim 12, furthercomprising a computer-readable medium having the model representing thenetwork stored thereon.
 20. The system according to claim 12, furthercomprising a processor in operable communication with the switch.